Fine-Grained Direct Preference Alignment Framework for Generative Visual Perception
XIE Tao1, YUAN Yuxuan1, ZUO Wangmeng2, LI Ruifeng1, ZHAO Lijun1
1. School of Mechatronics Engineering, Harbin Institute of Technology, Harbin 150001; 2. Faculty of Computing, Harbin Institute of Technology, Harbin 150001
Abstract:Generative referring segmentation methods based on multimodal large language model(MLLM) are limited by the mechanism of Supervised Fine-Tuning and lack in-depth exploration of ways to improve generation quality. Therefore, these methods are faced with the challenges of semantic localization bias and rough mask boundaries in complex scenarios. To address these issues, a fine-grained direct preference alignment framework for generative visual perception(FG-DPA) is proposed. The direct preference optimization(DPO) algorithm is transferred from text understanding to the pixel-level segmentation task. High-quality and low-quality mask preference pairs are constructed to guide the method toward learning more accurate visual representations within the latent space. Two types of negative samples are produced by leveraging the interactive characteristics of the segment anything model(SAM). To address the issue of imprecise edges, adversarial point prompts are introduced into the ground-truth bounding box to generate low-quality masks with local omissions or overflows as negative examples. To solve the problem of incorrect target localization, non-overlapping masks are randomly sampled in the background region to construct semantic-level negative examples. Through training with multiple samples, accurate segmentation is finally achieved in conjunction with SAM. Experiments on multiple public datasets show that FG-DPA effectively suppresses localization hallucination and significantly improves the completeness and edge accuracy of mask generation, validating its effectiveness in enhancing multimodal generative visual perception performance.
[1] LIU H T, LI C Y, WU Q Y, et al. Visual Instruction Tuning // Proc of the 37th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2023: 34892-34916. [2] LI J N, LI D X, SAVARESE S, et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models // Proc of the 40th International Conference on Machine Learning. San Diego, USA: JMLR, 2023: 19730-19742. [3] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: A Visual Language Model for Few-Shot Learning // Proc of the 36th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 23716-23736. [4] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 Technical Report[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2303.08774. [5] HU R H, ROHRBACH M, DARRELL T. Segmentation from Natural Language Expressions // Proc of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 108-124. [6] FAN Z D, ZHANG C, GAO J Y, et al. GFMLLM: Enhance Multi-modal Large Language Model for Global and Fine-Grained Visual Spatial Perception. Expert Systems with Applications, 2026, 299(D). DOI: 10.1016/j.eswa.2025.130239. [7] LAN M C, CHEN C F, ZHOU Y, et al. Text4seg: Reimagining Image Segmentation as Text Generation[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2410.09855v1. [8] CHEN Y C, LI W H, SUN C, et al. SAM4MLLM: Enhance Multi-modal Large Language Model for Referring Expression Segmentation // Proc of the 18th European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 323-340. [9] RAFAILOV R, SHARMA A, MITCHELL E, et al. Direct Prefe-rence Optimization: Your Language Model Is Secretly a Reward Mo-del // Proc of the 37th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2023, 36: 53728-53741. [10] KIRILLOV A, MINTUN E, RAVI N, et al. Segment Anything // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 3992-4002. [11] LAI X, TIAN Z T, CHEN Y K, et al. LISA: Reasoning Segmentation via Large Language Model // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 9579-9589. [12] XIA Z F, HAN D C, HAN Y Z, et al. GSVA: Generalized Segmentation via Multimodal Large Language Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 3858-3869. [13] REN Z W, HUANG Z C, WEI Y C, et al. PixeLLM: Pixel Reasoning with Large Multimodal Model // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 26374-26383. [14] RASHEED H, MAAZ M, SHAJI S, et al. GLaMM: Pixel Grounding Large Multimodal Model // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 13009-13018. [15] ZHU L Y, CHEN T R, XU Q X, et al. POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 30231-30240. [16] WU J N, ZHONG M Y, XING S, et al. VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks // Proc of the 38th International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2024: 69925-69975. [17] LAMBERT N. Reinforcement Learning from Human Feedback[C/OL]. [2025-12-17].https://arxiv.org/pdf/2504.12501. [18] OUYANG L, WU J, JIANG X, et al. Training Language Models to Follow Instructions with Human Feedback // Proc of the 36th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 27730-27744. [19] YANG A, LI A F, YANG B S, et al. Qwen3 Technical Report[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2505.09388. [20] HU E, SHEN Y L, WALLIS P, et al. LoRA: Low-Rank Adaptation of Large Language Models[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2106.09685. [21] AMINABADI R Y, RAJBHANDARI S, AWAN A A, et al. DeepSpeed-Inference: Enabling Efficient Inference of Transformer Mo-dels at Unprecedented Scale // Proc of the International Conference for High Performance Computing, Networking, Storage and Analysis. Washington, USA: IEEE, 2022. DOI: 10.1109/SC41404.2022.00051. [22] LOSHCHILOV I, HUTTER F. Decoupled Weight Decay Regularization[C/OL]. [2025-12-17].https://arxiv.org/pdf/1711.05101. [23] LI W Z, ZHAO Z C, BAI H C, et al. Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation. IEEE Transactions on Multimedia, 2025, 27: 6059-6069. [24] YAN B, JIANG Y, WU J N, et al. Universal Instance Perception as Object Discovery and Retrieval // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 15325-15336. [25] XIAO L H, YANG X S, PENG F, et al. CLIP-VG: Self-Paced Curriculum Adapting of CLIP for Visual Grounding. IEEE Transactions on Multimedia, 2024, 26: 4334-4347. [26] KAMATH A, SINGH M, LECUN Y, et al. MDETR-Modulated Detection for End-to-End Multi-modal Understanding // Proc of the IEEE/CVF International Conference on Computer Vision. Washing-ton, USA: IEEE, 2021: 1760-1770. [27] LI X C, FAN B Y, ZHANG R Z, et al. Inexactly Matched Refe-rring Expression Comprehension with Rationale. IEEE Transactions on Multimedia, 2024, 26: 3937-3950. [28] KE J C, WANG D L, CHEN J C, et al. Make Graph-Based Refe-rring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression. IEEE Transactions on Multimedia, 2025, 27: 1950-1961. [29] BAI S, CHEN K Q, LIU X J, et al. Qwen2.5-VL Technical Report[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2502.13923. [30] WANG X D, ZHANG S L, LI S F, et al. SegLLM: Multi-round Reasoning Segmentation with Large Language Models[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2410.18923.